knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(Seurat)
# library(plotly)
#install.packages("future")
# library(future)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
# library(gridExtra)
# library(ggrepel)
library(ggrepel)
load("../data/pre_integrated_SeurObj.RData")

setwd("/Volumes/easystore/SIMR_2019/pio-lab/scripts")

Objective:

Week of April 6th

Create Seurat Object for 10X homeo data, preprocesssing, QC filtering

nFeature_RNA is the number of genes detected in each cell.

nCount_RNA is the total number of molecules detected within a cell.

Low nFeature_RNA for a cell indicates that it may be dead/dying or an empty droplet.

High nCount_RNA and/or nFeature_RNA indicates that the “cell” may in fact be a doublet (or multiplet).

nFeature.vln.homeo.isl1.sib.10X + nCount.vln.homeo.isl1.sib.10X + pct.mito.vln.homeo.isl1.sib.10X

featurescatter.nCountXpercent.mt.homeo.isl1.sib.10X + featurescatter.nCountXnFeature.homeo.isl1.sib.10X

Filter cutoff: #subsetting parameters: omit cells with genes less than 3000 and greater than 200. omit mitochondrial contamination greater than 10%. omit cells with molecules greater than 10,000

Noticable increase in cells/samples from 2979 to ~16000 using an alternative subsetting parameter: homeo.isl1_sib_10X <- subset(homeo.isl1_sib_10X, subset = nFeature_RNA > 200 & nFeature_RNA < 3000 & percent.mt < 10 & nCount_RNA <10000)

homeo.isl1_sib_10X <- subset(homeo.isl1_sib_10X, subset = nFeature_RNA > 200 & nFeature_RNA < 3000 & percent.mt < 10) will give 2979 cells, which is way to conservative, marker genes will not be expressed with this subset

homeo.isl1_sib_10X
## An object of class Seurat 
## 25622 features across 16607 samples within 1 assay 
## Active assay: RNA (25622 features)
##  2 dimensional reductions calculated: pca, umap

Find Variable Features

VariableFeature.labedlplot2.homeo.isl1_sib_10X

Error in getGlobalsAndPackages(expr, envir = envir, globals = globals) : The total size of the 11 globals that need to be exported for the future expression (‘FUN()’) is 3.18 GiB. This exceeds the maximum allowed size of 0.98 GiB (option ‘future.globals.maxSize’). The three largest globals are ‘object’ (3.17 GiB of class ‘numeric’), ‘features’ (1.67 MiB of class ‘character’) and ‘split.cells’ (1.39 MiB of class ‘list’).

Formerally using set future.global to 1G, change to 3.5G however activit usage on local mac is using 6G/8G. If R session aborts, will have to resort to “sequential” and fargo parallelization.

Parallelization seems to force abort R session, switch to sequential process from now on.

Choosing PC

ElbowPlot(homeo.isl1_sib_10X, ndims = 50)

ElbowPlot(homeo.isl1_sib_10X, ndims = 30)

ElbowPlot(homeo.isl1_sib_10X, ndims = 20)

Scree plot shows that variance/eigenvalues in data seem to level off between 15-20 PCs. We’ll get the most information within the first ~15 PCs. Use jackstraw to confirm statistical power.

JackStrawPlot.PC20.homeo.isl1_sib_10X

Showing less statistical significance past >18 PC. Will set PC to 15

PC 15 with resolution 1.2, 14 clusters: gave pretty neat cluster, some single cells outside their prespective groupings however no striking outliers

umap.unlabeled

24 Clusters generated

Annotation

List of marker genes was taken from Lush (2019) publications

Functions created to help annotate and visualize marker genes:

####FeaturePlotToPNG#### Given list of marker genes, will print individual FeaturePlot and directs the plots to the specified directory

figure_dir <- "isl1_sib_counts_10X_figures"
FeaturePlotToPng <- function(marker.list, dir_name) {
  for (x in marker.list){
    #to.png <- FeaturePlot(homeo.isl1_sib_10X, features = x, label = TRUE) 
    print(x)
    mypath <- file.path("./", figure_dir, paste(x, ".png", sep = ""))
    print(mypath)
    png(file=mypath,width = 11, height = 9, units = 'in', res = 300)
    print(FeaturePlot(homeo.isl1_sib_10X, features = x, label = TRUE) )
    dev.off()}}

####VlnPlotToPNG Given list of marker genes, will print individual Vlnplot and directs the plots to the specified directory

VlnPlotToPng <- function(marker.list, dir_name) {
  for (x in marker.list){
    #to.png <- FeaturePlot(homeo.isl1_sib_10X, features = x, label = TRUE) 
    print(x)
    mypath <- file.path("./", figure_dir, paste(x, "_vlnplot.png", sep = ""))
    print(mypath)
    
    png(file=mypath,width = 11, height = 9, units = 'in', res = 300)
    print(VlnPlot(homeo.isl1_sib_10X, features = x, , pt.size = 0))
    dev.off()
  }
}

#####Top10GeneCellIdentityFunction##### This function will return a adjusted dataframe for the genes with the 10 (or less) highest avg_logFC and its corresponding cluster it will also generate a FeaturePlot and Violin Plot corresponding this function may be helpful as most canonical markers have low signal, this will discern what signal is highest to help identify clusters

Top10GeneCellIdentity <- function(marker.list, df){
  #This function will return a adjusted dataframe for the genes with the 10 (or less) highest avg_logFC
  #and its corresponding cluster
  #it will also generate a FeaturePlot and Violin Plot corresponding
  #this function may be helpful as most canonical markers have low signal, this will discern what signal is highest to help identify clusters
  top10.df <-filter(df, Gene.name.uniq == marker.list) %>% top_n(n = 10, wt = (avg_logFC)) %>% arrange(desc(avg_logFC))
  gene.list <- unique(top10.df$Gene.name.uniq)
  feature.plot <- FeaturePlot(homeo.isl1_sib_10X, features = gene.list, label = TRUE)
  vln.plt <- VlnPlot(homeo.isl1_sib_10X, features = gene.list, pt.size = 0)
  return(list(top10.df, feature.plot, vln.plt))
  dev.off()
}

More Robust Marker Gene

Use excel sheet for Lush (2019) suppl file 9 for list of hair cells in their young, mature, prog stages

Robust Marker Feature Plots

Robust Marker Feature Plots

Renamed Cluster Identity

Renamed Cluster Umap

Renamed Cluster Umap

Note: Tiny clusters with -pos suffix indicate markers that have a high expression for that cluster.

Use command line to search for the top 20-200 genes within that cluster to determine identity name. Although it may not be the first highly express, search for common gene, see mfap-pos cluster for example

filter(all.markers.homeo.isl1_sib_10X, cluster == x) %>% top_n(n = 20, wt = (avg_logFC)) %>% arrange(desc(avg_logFC))

Week April 13th

Reoccuring error message: Error: Feature names of counts matrix cannot be empty

1st method: load in seurat object with ensembl id and convert object[["RNA]]@counts to gene symbol manually however the error message will appear in IntegrateData()

2nd method: convert 10X homeo data to ensembl ID to match smartseq2 rownames error message: Error: vector memory exhausted (limit reached?) due to large memory size of 10X homeo data

attemped resolution: Step 1: Open terminal,

Step 2:

cd ~ touch .Renviron open .Renviron Step 3: Save the following as the first line of .Renviron:

R_MAX_VSIZE=8Gb

to increase RAM capacity on local computer

resolution: create a job on talapas virtual desktop

checking for empty cell in rownames(fpkm_matrix) #check if rownames is empty which(rownames(x = fpkm_expression_mtx) == '') #[1] 11478 print(rownames(fpkm_expression_mtx)[11478]) #[1] "" #delete new_fpkm_expression_mtx <- fpkm_expression_mtx[-11478,] #turn to sparse matrix fpkm_expression_mtx <- as.sparse(fpkm_expression_mtx)

#check dim(new_fpkm_expression_mtx) #[1] 32488 649

which(rownames(x = new_fpkm_expression_mtx) == '') #integer(0)

error resolved, can proceed with integration

fpkm smartseq umap Integrated Cluster Umap

Integration of data

Integrating data|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=57s Warning: Adding a command log without an assay associated with it

Integrated Cluster Umap

Integrated Cluster Umap